Method and apparatus  for Utilizing Structural Information in Semi-Structured Documents to Generate Candidates for Question Answering Systems

ABSTRACT

An approach to candidate answer generation by leveraging structural information in semi-structured resources, such as the title of a document and anchor texts in a document.

FIELD OF THE INVENTION

The invention relates generally to question answering systems used togenerate candidate answers by consulting a possibly heterogeneouscollection of structured, semi-structured, and unstructured informationresources.

RELATED ART

Most question answering (QA) systems suffer from two significantdeficiencies. First, the systems rely on the question analysis componentcorrectly identifying the semantic type of the answer and the namedentity recognizer correctly identifying the correct answer as thatsemantic type. Failure at either stage produces an error from which thesystem cannot recover.

Second, most QA systems are not amenable to questions without answertypes, such as “What was the Parthenon converted into in 1460?” For suchquestions, oftentimes all noun phrases from the search output areextracted, leading to a large number of extraneous and at timesnon-sensible candidate answers in the context of the question.

SUMMARY

The present invention provides an approach to candidate answergeneration by leveraging structural information in semi-structuredresources, such as the title of a document and anchor texts in adocument.

In one aspect, the invention provides a method for candidate generationfor question answering including receiving a natural language questionand formulating queries used to retrieve search results includingdocuments and passages that are relevant to answering the naturallanguage question; extracting from the search results potential answersto the natural language question; and scoring and ranking the answers toproduce a final ranked list of answers with associated confidencescores.

The method for candidate generation for question answering includesreceiving at least one document or passage together with its provenanceinformation; accessing a semi-structured source of information based onthe provenance; retrieving substructures/entities including a title of adocument and anchor text from the passage within the document; applyinga normalization operation, such as replacing the html symbol “&nsp;”with a space character or removing the disambiguation field in aWikipedia article titles (e.g. removing the text in parenthesis fortitle Titanic (1997 film)), to the substructure/entity (e.g. titles andanchor texts); and returning the resulting list of candidate answers.

The approach improves upon previous generation methods by producingcandidate answers in a context-dependent fashion without requiring highaccuracy in answer type detection and named entity recognition. Theapproach is applicable to questions with both definitive semantic answertypes as well as untyped questions, and in the latter case, improvesoverall system efficiency by generating a significantly smaller set ofcandidate answers through leveraging context-dependent structuralinformation.

The objects and features of the present invention, which are believed tobe novel, are set forth with particularity in the appended claims. Thepresent invention, both as to its organization and manner of operation,together with further objects and advantages, may best be understood byreference to the following description, taken in connection with theaccompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

The foregoing features and other features of the present invention willnow be described with reference to the drawings. In the drawings, thesame components have the same reference numerals. The illustratedembodiment is intended to illustrate, but not to limit the invention.The drawings include the following Figures:

FIG. 1 illustrates the components of a canonical question answeringsystem and its workflow;

FIG. 2 is a flow diagram illustrating an approach to candidate answergeneration by leveraging structural information in semi-structuredresources, such as the title of a document; and

FIG. 3 is a flow diagram illustrating an approach to candidate answergeneration by leveraging structural information in semi-structuredresources, such as anchor texts in a document.

DETAILED DESCRIPTION

The present invention may be described herein in terms of variouscomponents and processing steps. It should be appreciated that suchcomponents and steps may be realized by any number of hardware andsoftware components configured to perform the specified functions. Forexample, the present invention may employ various electronic controldevices, visual display devices, input terminals and the like, which maycarry out a variety of functions under the control of one or morecontrol systems, microprocessors or other control devices.

In addition, the present invention may be practiced in any number ofcontexts and the exemplary embodiments relating to a searching systemand method as described herein are merely a few of the exemplaryapplications for the invention. The processing steps may be conductedwith one or more computer-based systems through the use of one or morealgorithms.

FIG. 1 illustrates the components of a QA system 100 and its workflow,including question analysis component 104, search component 106,candidate generation component 108 and answer selection component 110.

In operation, the question analysis component 104 receives a naturallanguage question 102, for example, “Who is the 42nd president of theUnited States?” Question analysis component 104 analyzes the question toproduce, minimally, the semantic type of the expected answer (in thisexample, “president”), and optionally other analysis results fordownstream processing.

The search component 106 formulates queries from the output of questionanalysis and consults various resources, for example, the world wide weband databases 107, to retrieve documents, passages, database tuples, andthe like, that are relevant to answering the question.

The candidate generation component 108 then extracts from the searchresults potential answers to the question, which are then scored andranked by the answer selection component 110 to produce a final rankedlist of answers 112 with associated confidence scores.

Candidate generation component 108 is an important component in questionanswering systems in which potential answers to a given question areextracted from the search results. In a typical question answeringsystem, candidate answers are identified based on the semantic typematch between the answer type as determined by the question analysiscomponent 104 and entities extracted from the search results via a namedentity recognizer. For example, for the sample question “Who is the 42ndpresident of the United States?” all candidate answers will be of thesemantic type US president.

FIGS. 2 and 3 are flow diagrams illustrating an approach to candidateanswer generation by leveraging structural information insemi-structured resources, such as the title of a document and anchortexts in a document. This approach improves upon previous candidategeneration methods by producing candidate answers in a context-dependentfashion without the reliance requiring high accuracy in on the highaccuracy required of answer type detection and named entity recognition.

This approach is applicable to questions with both definitive semanticanswer types as well as untyped questions, and improves overall systemefficiency by generating a significantly smaller set of candidateanswers through leveraging context-dependent structural information.

In certain types of documents, such as Encyclopedia articles and thelike, the document title is an excellent candidate answer for propertiesdescribed in the article about the title entity. For example, in adocument about the band “The First Edition”, the following facts areprovided: “The First Edition was a rock band, stalwart members beingKenny Rogers, Mickey Jones, and Terry Williams. The band formed in 1967,with noted folk musicians Mike Settle and the operatically trainedThelma Camacho completing the lineup” and “The First Edition were(outside of Mickey Jones) made up of former New Christy Minstrels whofelt creatively stifled.” Given the question “What is the rock bandformed by Kenny Rogers and other members of the New Christy Minstrels in1967?” the search component 106 of QA system 100 is likely to include adocument 202, for example “The First Edition” or passage texts 204extracted from document 202 among its search results.

In one embodiment of the present invention, candidate generationcomponent 108 performs document title approach 200 by extractingcandidate answers from search results. If the search results includedocuments 202, then the “title field” of these documents, such as “TheFirst Edition”, is extracted using title retrieval component 208 asillustrated in FIG. 2. One implementation of title approach 200 may bedone through a database table lookup. If the search results includepassage texts 204, then the documents that contain passage texts 202 aare retrieved through document retrieval component 206. This may beaccomplished by retrieving the provenance information of the passagetexts. Once the documents containing the passage texts 202 a areobtained, the titles may be retrieved using title retrieval component208 and are used as candidate answers 210.

In the event that search component 106 returns document 202 as itssearch result, title retrieval component 208 returns the title ofdocument 202 as a candidate answer 210.

In the event that search component 106 returns a passage 204 (forexample, a short 1-3 sentence text snippet), then a document 202 a fromwhich passage 204 has been extracted is searched for and identifiedusing document retrieval component 206.

Document retrieval component 206 is configured to match passage 204against a set of free-text records. These records could be any type ofmainly unstructured text, such as newspaper articles, real estaterecords or paragraphs in a manual. Passages 204 may range frommulti-sentence full descriptions of an information need to a few words.

Once document 202 a has been identified, title retrieval component 208returns the title of document 202 a as a candidate answer 210.

In another embodiment, candidate generation component 108 includesanchor text retrieval approach 300, which leverages anchor texts foundin a passage/document 302 to extract candidate answers 210 from textretrieved from passage/document 302. Anchor texts are text stringshighlighted in a document to indicate hyperlinks to other documents.

As illustrated in FIG. 3, candidate generation component 108 uses anchortexts 304 as a candidate generation mechanism in QA system 100. For eachdocument-oriented search result (i.e. passage or document 302) fromsearch component 106, retrieve component 306 identifies the documentfrom which the retrieved text has been extracted. Retrieve component 306then retrieves all anchor texts 304 that are present in document 302. Animplementation of using anchor texts 304 may be through a databaselookup in which the database stores pairs of a document ID and a list ofall anchor texts in that particular document. Given a search result,such as passage/document 302, the document ID of passage/document 302may either be obtained through provenance information in the searchresult, if available, or retrieved again through retrieve component 306.Once the document ID is identified, the list of anchor texts 304 in thatdocument may be obtained through a simple database query. The subset ofthe list of anchor texts that are present in passage/document 302 areselected as candidate answers 210.

Next, in match component 308, anchor texts 304 are matched against theretrieved text and all anchor texts 304 that are present in theretrieved text are selected as candidate answers 210.

It should be understood that the approaches described above regardingFIGS. 2 and 3 to candidate answer generation by leveraging structuralinformation in semi-structured resources, such as the title of adocument and anchor texts in a document, may be used separately or maybe combined. For example, in one embodiment, anchor text retrievalapproach 300 in which candidate generation component 108 uses anchortexts 304 as a candidate generation mechanism in QA system 100, may beapplied to the candidate answers 210 extracted using document titleapproach 200. This is performed by treating each extracted candidateanswer 210 from approach 200 as a search result (i.e. passage/document302) and further extracting anchor text sub-candidates from withincandidate answers 210. For example, for candidate answer 210 “List ofDeserts in Australia”, the candidate answer 210 “Australia” may begenerated.

The invention has been disclosed in an illustrative manner. Accordingly,the terminology employed throughout should be read in an exemplaryrather than a limiting manner. Although minor modifications of theinvention will occur to those of ordinary skill in the art, it shall beunderstood that what is intended to be circumscribed within the scope ofthe patent warranted hereon are all such embodiments that reasonablyfall within the scope of the advancement to the art hereby contributed,and that scope shall not be restricted, except in light of the appendedclaims and their equivalents.

1. A method for candidate generation for question answering comprising:receiving a natural language question and formulating queries used toretrieve search results including documents and passages that arerelevant to answering the natural language question; receiving at leastone document or passage together with its provenance information;accessing a semi-structured source of information based on theprovenance; retrieving substructures/entities including a title of adocument and anchor texts from the passage within the document;extracting from the search results candidate answers to the naturallanguage question; selecting the title and the anchor texts as candidateanswers; applying a normalization operation to the substructure/entity;scoring and ranking the answers to produce a final ranked list ofcandidate answers with associated confidence scores; and returning theresulting list of the candidate answers.